Student information¶

Raúl van Harmelen: 14665387

Elianto Dijk: 14434091

Fardeen Wieliams: 13770160

Sharynho Markelo: 14579073

Importing the relevant Python libaries¶

In [1]:
import csv
import networkx as nx
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import preprocessing
import statsmodels

Table of contents¶

  1. Introduction
  2. Datasets and preprocessing
  3. Who knows whom? - Degree distribution
  4. Small world property
  5. Bigger network
  6. Friendships in communities - Triadic closure
  7. Conclusion
  8. Reflection
  9. Work distribution
  10. References
  11. Appendix

1. Introduction¶

Have you ever wondered how people interact with each other? How do these interactions differ online and offline? These types of questions concern network science. A network is a web that consists of nodes and edges. People, devices, accounts or other objects can be represented as nodes and the edges can be represented as the relationships between these nodes. All these nodes and edges together form a network. Network science studies the relationships between these nodes. In this project, different social networks will be analysed. The following two perspectives in this project are:

  1. Online social networks based on real life social interaction are likely to resemble real networks.
  2. Online social networks without real life social interaction are likely to resemble random networks.

In order to analyse both perspectives, different network properties are calculated. Leaping into the future, we were able to successfully compute all these different network properties. From network science it appears that networks emerged from real social interactions have different characteristics compared to random networks. A distinction must be made between random networks and real networks. According to Barabasi (2016), a random network consists of N nodes where each node pair is connected with probability p. Because a random network is constructed by chance, the characteristics are typical for this randomness. Barabasi (2016) also explains that real networks have certain characteristics since that cannot be explained by random network models. The reason for this is that real networks are not constructed by chance, but are a result of social interactions which cannot be reacreated by computers.

If online social interactions form networks similar to real networks, then online social interactions are similar to real social interactions. If not, it means social interactions are different from real interactions.

2. Datasets and preprocessing¶

This project is based on three different datasets of social networks that can be found on Kaggle (Appendix). Each dataset is a representation of a network in the form of a list of edges. To some datasets there is also a file added with node attributes. The latest dataset is slightly different from the first two, because it is a dataset containing multiple datasets of social networks.

Before the datasets can be read in, it is necessary to manually select which network datasets will and will not be selected for the project. This selection is based on the size of the datasets to reduce the risk of very long processing times. Besides, any not connected network is excluded from the project to simplify calculation tasks. The table below is an overview of all the different networks, retrieved from the datasets, that will be used in this project.

Dataset Node representation Link representation Node attributes if applicable
NashvilleMeetupNetwork Member of a Meetup group Shared group membership in 'weight' groups N/A
DeezerHR Deezer users from Croatia The relation friendship Genre preferences of each user
DeezerHU Deezer users from Hungary The relation friendship Genre preferences of each user
DeezerRO Deezer users from Romania The relation friendship Genre preferences of each user
FacebookLargePage Official Facebook pages The amount of mutual likes between pages Descriptions of the purpose of the site
FeatherDeezerSocial Deezer users from Europe The relation friendship Artists liked by the users
FeatherLastfmSocial Lastfm users from Asia The relation friendship Artists liked by the users
GithubSocial Developers who have starred at least 10 repositories Mutual follwing between developers Location, Repositories starred, Employer and E-mail address
TwitchSocialNetworks Twitch users The relation friendship Games played and liked, Location and Streaming habits
TwitchSocialNetworksDE Twitch users from Germany The relation friendship Games played and liked, Location and Streaming habits
TwitchSocialNetworksENGB Twitch users from England The relation friendship Games played and liked, Location and Streaming habits
TwitchSocialNetworksES Twitch users from Spain The relation friendship Games played and liked, Location and Streaming habits
TwitchSocialNetworksFR Twitch users from France The relation friendship Games played and liked, Location and Streaming habits
TwitchSocialNetworksPTBR Twitch users from Brazil The relation friendship Games played and liked, Location and Streaming habits
TwitchSocialNetworksRU Twitch users from Russia The relation friendship Games played and liked, Location and Streaming habits
SocEpinions1 Epinion users Trust relationship between users N/A
SocSignSlashdot090221 Slashdot users 2009 Friend or enemy relationship N/A
SocSlashdot0902 Slashdot users 2009 Friend or enemy relationship N/A

Preprocessing non-csv datasets¶

Some datasets of networks have their edges saved in a .txt file while others in a .csv file. In order to read the edges with the pandas read_csv function those .txt files need to be converted to .csv files. A function is used to accomplish this that can be found in the preprocessing.py file.

  • soc-sign-Slashdot090221.txt
  • soc-Slashdot0902.txt
In [2]:
# preprocessing.transform_text_to_csv('datasets/soc-sign-Slashdot090221.txt', 'datasets/soc-sign-Slashdot090221.csv')
# preprocessing.transform_text_to_csv('datasets/soc-Slashdot0902.txt', 'datasets/soc-Slashdot0902.csv')

Read in datasets with pandas¶

The datasets can be easily read in with pandas.

In [3]:
NashvilleMeetupNetwork_member_edges = pd.read_csv('datasets/NashvilleMeetupNetwork/member-edges.csv', index_col=0)
DeezerHR_edges = pd.read_csv('datasets/DeezerSocialNetworks/HR/HR_edges.csv')
DeezerHU_edges = pd.read_csv('datasets/DeezerSocialNetworks/HU/HU_edges.csv')
DeezerRO_edges = pd.read_csv('datasets/DeezerSocialNetworks/RO/RO_edges.csv')
FacebookLargePage_edges = pd.read_csv('datasets/facebook-large-page-page-network/musae_facebook_edges.csv')
FeatherDeezerSocial_edges = pd.read_csv('datasets/feather-deezer-social/deezer_europe_edges.csv')
FeatherLastfmSocial_edges = pd.read_csv('datasets/feather-lastfm-social/lastfm_asia_edges.csv')
GithubSocial_edges = pd.read_csv("datasets/github-social/musae_git_edges.csv")
TwitchSocialNetworksDE_edges = pd.read_csv("datasets/twitch-social-networks/DE/musae_DE_edges.csv")
TwitchSocialNetworksENGB_edges = pd.read_csv("datasets/twitch-social-networks/ENGB/musae_ENGB_edges.csv")
TwitchSocialNetworksES_edges = pd.read_csv("datasets/twitch-social-networks/ES/musae_ES_edges.csv")
TwitchSocialNetworksFR_edges = pd.read_csv("datasets/twitch-social-networks/FR/musae_FR_edges.csv")
TwitchSocialNetworksPTBR_edges = pd.read_csv("datasets/twitch-social-networks/PTBR/musae_PTBR_edges.csv")
TwitchSocialNetworksRU_edges = pd.read_csv("datasets/twitch-social-networks/RU/musae_RU_edges.csv")
SocSignSlashdot090221_edges = pd.read_csv('datasets/soc-sign-Slashdot090221.csv')
SocSlashdot0902_edges = pd.read_csv('datasets/soc-Slashdot0902.csv')

Construct graphs with networkx¶

After that, undirected networkx graphs can be constructed from the pandas dataframes.

In [4]:
NashvilleMeetupNetwork = nx.from_pandas_edgelist(NashvilleMeetupNetwork_member_edges, 'member1', 'member2', edge_attr='weight', create_using=nx.Graph)
DeezerHR = nx.from_pandas_edgelist(DeezerHR_edges, 'node_1', 'node_2', create_using=nx.Graph)
DeezerHU = nx.from_pandas_edgelist(DeezerHU_edges, 'node_1', 'node_2', create_using=nx.Graph)
DeezerRO = nx.from_pandas_edgelist(DeezerRO_edges, 'node_1', 'node_2', create_using=nx.Graph)
FacebookLargePage = nx.from_pandas_edgelist(FacebookLargePage_edges, 'id_1', 'id_2', create_using=nx.Graph)
FeatherDeezerSocial = nx.from_pandas_edgelist(FeatherDeezerSocial_edges, 'node_1', 'node_2', create_using=nx.Graph)
FeatherLastfmSocial = nx.from_pandas_edgelist(FeatherLastfmSocial_edges, 'node_1', 'node_2', create_using=nx.Graph) 
GithubSocial = nx.from_pandas_edgelist(GithubSocial_edges, 'id_1', 'id_2', create_using=nx.Graph)
TwitchSocialNetworksDE = nx.from_pandas_edgelist(TwitchSocialNetworksDE_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksENGB = nx.from_pandas_edgelist(TwitchSocialNetworksENGB_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksES = nx.from_pandas_edgelist(TwitchSocialNetworksES_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksFR = nx.from_pandas_edgelist(TwitchSocialNetworksFR_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksPTBR = nx.from_pandas_edgelist(TwitchSocialNetworksPTBR_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksRU = nx.from_pandas_edgelist(TwitchSocialNetworksRU_edges, 'from', 'to', create_using=nx.Graph)
SocSignSlashdot090221 = nx.from_pandas_edgelist(SocSignSlashdot090221_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=nx.Graph)
SocSlashdot0902 = nx.from_pandas_edgelist(SocSlashdot0902_edges, '# FromNodeId', 'ToNodeId', create_using=nx.Graph)

Destriptive statistics of different networks¶

The final part of the preprocessing part consists of calculating different network properties for each network. The visualizations section explains what the different properties mean in more detail.

In [5]:
networks = {
    'NashvilleMeetupNetwork' : NashvilleMeetupNetwork,
    'DeezerHR' : DeezerHR,
    'DeezerHU' : DeezerHU,
    'DeezerRO' : DeezerRO,
    'FacebookLargepage' : FacebookLargePage,
    'FeatherDeezerSocial' : FeatherDeezerSocial,
    'FeatherLastfmSocial' : FeatherLastfmSocial, 
    'GithubSocial' : GithubSocial,
    'TwitchSocialNetworksDE' : TwitchSocialNetworksDE,
    'TwitchSocialNetworksENGB' : TwitchSocialNetworksENGB,
    'TwitchSocialNetworksES' : TwitchSocialNetworksES,
    'TwitchSocialNetworksFR' : TwitchSocialNetworksFR,
    'TwitchSocialNetworksPTBR' : TwitchSocialNetworksPTBR,
    'TwitchSocialNetworksRU' : TwitchSocialNetworksRU,
    'SocSignSlashdot090221' : SocSignSlashdot090221,
    'SocSlashdot0902' : SocSlashdot0902,
}

The code below uses networkx and will take some time to run. That is why the results are written to a CSV-file that can be read in later.

In [6]:
# start = time.time()
# descriptive_stats = pd.DataFrame(index=list(networks.keys()), columns=['number_of_nodes', 'number_of_edges', 'average_degree', 'density', 'connected_network', 'avg_cc', 'transitivity', 'average_shortest_path_length', 'diameter', 'has_bridges', 'number_of_bridges', 'max_degree'])
# for name, network in networks.items():
#     descriptive_stats['number_of_nodes'].loc[name] = nx.number_of_nodes(network)
#     descriptive_stats['number_of_edges'].loc[name] = nx.number_of_edges(network)
#     descriptive_stats['average_degree'].loc[name] = np.mean(list(dict(network.degree()).values()))
#     descriptive_stats['density'].loc[name] = nx.density(network)
#     descriptive_stats['connected_network'].loc[name] = nx.is_connected(network)
#     descriptive_stats['avg_cc'].loc[name] = nx.average_clustering(network)
#     descriptive_stats['transitivity'].loc[name] = nx.transitivity(network)
#     descriptive_stats['has_bridges'].loc[name] = nx.has_bridges(network)
#     descriptive_stats['number_of_bridges'].loc[name] = len(list(nx.bridges(network)))
#     descriptive_stats['max_degree'].loc[name] = max(dict(nx.degree(network)).values())
    
# descriptive_stats.to_csv('results/descriptive_stats_of_networks_2.csv')
# end = time.time()
# print(f'tijd in seconden:{end - start}')

Networkx can be used to calculate many different network properties as seen above. However, some network properties require more efficient algorithms to compute.

Below this cell all the code can be found that was used to calculated additional network properties of the network. This code was run in a google Colab notebook because this allows us to use Google's GPU for gpu accelerated computing. Instead of pandas and networkx, the python libaries cudf and cugraph were used.

In [7]:
# import cudf
# import cugraph as cnx
# import time
# import numpy as np
# import json
# import itertools

# # Reading in the data with cudf
# NashvilleMeetupNetwork_member_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/NashvilleMeetupNetwork/member-edges.csv', index_col=0)
# DeezerHR_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/HR/HR_edges.csv')
# DeezerHU_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/HU/HU_edges.csv')
# DeezerRO_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/RO/RO_edges.csv')
# FacebookLargePage_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/facebook-large-page-page-network/musae_facebook_edges.csv')
# FeatherDeezerSocial_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/feather-deezer-social/deezer_europe_edges.csv')
# FeatherLastfmSocial_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/feather-lastfm-social/lastfm_asia_edges.csv')
# GithubSocial_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/github-social/musae_git_edges.csv")
# TwitchSocialNetworksDE_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/DE/musae_DE_edges.csv")
# TwitchSocialNetworksENGB_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/ENGB/musae_ENGB_edges.csv")
# TwitchSocialNetworksES_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/ES/musae_ES_edges.csv")
# TwitchSocialNetworksFR_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/FR/musae_FR_edges.csv")
# TwitchSocialNetworksPTBR_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/PTBR/musae_PTBR_edges.csv")
# TwitchSocialNetworksRU_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/RU/musae_RU_edges.csv")
# SocSignSlashdot090221_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/soc-sign-Slashdot090221.csv')
# SocSlashdot0902_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/soc-Slashdot0902.csv')

# # Constructing networks with cugraph
# NashvilleMeetupNetwork = cnx.from_cudf_edgelist(NashvilleMeetupNetwork_member_edges, 'member1', 'member2', edge_attr='weight', create_using=cnx.Graph)
# DeezerHR = cnx.from_cudf_edgelist(DeezerHR_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# DeezerHU = cnx.from_cudf_edgelist(DeezerHU_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# DeezerRO = cnx.from_cudf_edgelist(DeezerRO_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# FacebookLargePage = cnx.from_cudf_edgelist(FacebookLargePage_edges, 'id_1', 'id_2', create_using=cnx.Graph)
# FeatherDeezerSocial = cnx.from_cudf_edgelist(FeatherDeezerSocial_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# FeatherLastfmSocial = cnx.from_cudf_edgelist(FeatherLastfmSocial_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# GithubSocial = cnx.from_cudf_edgelist(GithubSocial_edges, 'id_1', 'id_2', create_using=cnx.Graph)
# TwitchSocialNetworksDE = cnx.from_cudf_edgelist(TwitchSocialNetworksDE_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksENGB = cnx.from_cudf_edgelist(TwitchSocialNetworksENGB_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksES = cnx.from_cudf_edgelist(TwitchSocialNetworksES_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksFR = cnx.from_cudf_edgelist(TwitchSocialNetworksFR_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksPTBR = cnx.from_cudf_edgelist(TwitchSocialNetworksPTBR_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksRU = cnx.from_cudf_edgelist(TwitchSocialNetworksRU_edges, 'from', 'to', create_using=cnx.Graph)
# SocSignSlashdot090221 = cnx.from_cudf_edgelist(SocSignSlashdot090221_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=cnx.Graph)
# SocSlashdot0902 = cnx.from_cudf_edgelist(SocSlashdot0902_edges, '# FromNodeId', 'ToNodeId', create_using=cnx.Graph)

# networks = {
#     'NashvilleMeetupNetwork' : NashvilleMeetupNetwork,
#     'DeezerHR' : DeezerHR,
#     'DeezerHU' : DeezerHU,
#     'DeezerRO' : DeezerRO,
#     'FacebookLargepage' : FacebookLargePage,
#     'FeatherDeezerSocial' : FeatherDeezerSocial,
#     'FeatherLastfmSocial' : FeatherLastfmSocial,
#     'GithubSocial' : GithubSocial,
#     'TwitchSocialNetworksDE' : TwitchSocialNetworksDE,
#     'TwitchSocialNetworksENGB' : TwitchSocialNetworksENGB,
#     'TwitchSocialNetworksES' : TwitchSocialNetworksES,
#     'TwitchSocialNetworksFR' : TwitchSocialNetworksFR,
#     'TwitchSocialNetworksPTBR' : TwitchSocialNetworksPTBR,
#     'TwitchSocialNetworksRU' : TwitchSocialNetworksRU,
#     'SocSignSlashdot090221' : SocSignSlashdot090221,
#     'SocSlashdot0902' : SocSlashdot0902,
# }

# # Code used to compute the average shortest path length and diameter
# start = time.time()
# descriptive_stats = cudf.DataFrame(index=list(networks.keys()), columns=['average_shortest_path_length', 'diameter'])
# for name, network in networks.items():
#     shortest_path_lengths = [cnx.bfs_edges(network, source=x)['distance'].max() for x in network.nodes().to_numpy()]
#     descriptive_stats['average_shortest_path_length'].loc[name] = sum(shortest_path_lengths)/len(shortest_path_lengths)
#     descriptive_stats['diameter'].loc[name] = max(shortest_path_lengths)
#     descriptive_stats.to_csv('/content/drive/MyDrive/Colab Notebooks/results/descriptive_stats_of_networks.csv')
# end = time.time()
# print(f'tijd in seconden:{end - start}')

Combining the calculatings from networkx and cugraph gives the following dataframe:

In [8]:
descriptive_stats = pd.read_csv('results/descriptive_stats_of_networks.csv', index_col=0)
descriptive_stats
Out[8]:
number_of_nodes number_of_edges average_degree density connected_network avg_cc transitivity average_shortest_path_length diameter has_bridges number_of_bridges max_degree
NashvilleMeetupNetwork 11372 1176368 206.888498 0.018194 True 0.884957 0.604407 4.413648 6 True 13 2356
DeezerHR 54573 498202 18.258186 0.000335 True 0.136477 0.114630 8.501347 12 True 2435 420
DeezerHU 47538 222887 9.377214 0.000197 True 0.116187 0.092924 9.930603 14 True 2869 112
DeezerRO 41773 125826 6.024274 0.000144 True 0.091212 0.075267 12.970340 19 True 6403 112
FacebookLargepage 22470 171002 15.220472 0.000677 True 0.359738 0.232321 10.443525 15 True 2973 709
FeatherDeezerSocial 28281 92752 6.559315 0.000232 True 0.141160 0.095922 15.138963 21 True 6470 172
FeatherLastfmSocial 7624 27806 7.294334 0.000957 True 0.219418 0.178623 10.457240 15 True 1929 216
GithubSocial 37700 289003 15.331724 0.000407 True 0.167537 0.012357 7.228674 11 True 5241 9458
TwitchSocialNetworksDE 9498 153138 32.246368 0.003395 True 0.200886 0.046471 5.085176 7 True 453 4259
TwitchSocialNetworksENGB 7126 35324 9.914117 0.001391 True 0.130928 0.042433 7.001684 10 True 1222 720
TwitchSocialNetworksES 4648 59382 25.551635 0.005499 True 0.222496 0.084235 6.649312 9 True 301 1022
TwitchSocialNetworksFR 6549 112666 34.407085 0.005255 True 0.221706 0.054128 5.089327 7 True 257 2040
TwitchSocialNetworksPTBR 1912 31299 32.739540 0.017132 True 0.319895 0.130981 4.905335 7 True 116 767
TwitchSocialNetworksRU 4385 37304 17.014367 0.003881 True 0.165797 0.048648 6.288940 9 True 461 1229
SocSignSlashdot090221 82140 500481 12.186048 0.000148 True 0.058787 0.023618 8.911346 13 True 30086 2548
SocSlashdot0902 82168 582533 14.179072 0.000173 True 0.060345 0.024109 8.910720 13 True 30035 2554

3. Who knows whom? - Degree distribution¶

Degree distribution of networks.

Firstly, we will look at the amount of acquaintances of a person in a network. The degree is the number of connections (or neighbors) a node has. A node with a high degree has connections with many other nodes and is often refferd to as a hub. A node with a low degree has links with only a few other nodes.

The degree distribution gives insight into the probability that a randomly selected node in the network has a certain degree. It is known that a random network has a binomial degree distribution (Barabasi, 2016). A real network, on the other hand has a degree distribution close to exponetional decline. By analysing the degree distribution of the different networks, we can take the first steps into categorizing the networks into real or random networks. If the distribution of a network is binomial, then the network is better described by the random network model. If not, it's more likely to be a real network.

In [9]:
fig = go.Figure()

#Add traces, one for each slider step
for step in np.arange(0.1, 1, 0.1):
    random_network = nx.gnp_random_graph(1500, step)
    degrees = [degree for _, degree in random_network.degree()]
    degree_counts = nx.degree_histogram(random_network)
    fraction_of_nodes = [count / len(random_network) for count in degree_counts]
    fig.add_trace(go.Histogram(x=degrees, y=fraction_of_nodes, visible=False))

# Make 0th trace visible
fig.data[0].visible = True

# Create and add slider
steps = []
for i in range(len(fig.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
              {"title": "Degree Distribution of a random network with probability: " + f"0.{i+1} (Figure 1)"}],  # layout attribute
    )
    step["args"][0]["visible"][i] = True  # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active=10,
    currentvalue={"prefix": "Probability: "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    title='Degree Distribution of a random network with probability: 0.1 (Figure 1)',
    xaxis_title='Degree',
    yaxis_title='Number of nodes with degree',
    bargap = 0.01,
    sliders=sliders
)
fig.show()

As can be seen above, a randomly generated network has a binomial structure, which means that most nodes have the same number of links and there are no or a few highly connected nodes. This is common for random networks. Next up we'll look at the networks that we're using and see if they're a random network with a binomial structure, or a real network with an exponential decline.

In [10]:
degree_dist = go.Figure()
for network_name, network in networks.items():
    degree_sequence = sorted((d for n, d in network.degree()), reverse=True)
    num_nodes = len(degree_sequence)
    fraction_nodes = np.arange(num_nodes) / num_nodes
    degree_dist.add_trace(go.Scatter(x=degree_sequence,
                                     y=fraction_nodes,
                                     mode='lines',
                                     name=network_name))

degree_dist.update_layout(
    title="Degree distribution of all networks (Figure 2)",
    xaxis_title="Degree",
    yaxis_title="Fraction of Nodes",
    xaxis_type="log",
)

degree_dist.show()

As the above chart shows, the degree distribution of all the networks are far from binomial degree distribution of random networks. We can see there are many nodes with only a few links and a few hubs with many links. The x-axis is even made logarithmic to account for the exponential distribution of the degree. Thus, considering the degree distribution, all observed networks are most likely real networks. By clicking on the names in the index, you can eliminate certain networks from the graph.

4. Small world property¶

Only few steps needed to reach any arbitrary node in the network.

Another property of real networks is that they tend to be small. What this means becomes clear after introducting the concept distance.

If it is possible to follow links to go from one node to another node in the network, it could be said that there is a path between two nodes. The path length is defined as the number of links in a path. Between the same nodes, there can be multiple paths. Therefore, the path with the least number of links is called the shortest path. With this definition, the distance between two nodes can be defined as the length of the shortest path between nodes, also called the shortest path length.

In order to investigated how close or far we expect nodes to be in a network network scientists look at the distances in a network. Subsequently, if distances are short in a network, a network is considered small. The average shortest path length, which is calculated by averaging the shortest path lengths across all pairs of nodes, can give insight into this.

$a =\sum_{\substack{s,t \in V \\ s\neq t}} \frac{d(s, t)}{n(n-1)}$

where V is the set of nodes in G, d(s, t) is the shortest path from s to t, and n is the number of nodes in G.

Another network property related to shortest paths, is the diameter, defined as the maximum shortest path length across all pairs of nodes.

Small world networks

Barabasi (2016) mentions another type of network introduced by Watts and Strogatz in 1998. A small world networks is a real networks with very short path lengths between nodes in a network. In other words, two individuals anywhere in the world can be connected through a few acquaintances. According to Menczer et al. (2020), the average path lengths of small world networks scale logarithmically with the size of the network.

$a \approx log(n)$

where n is the number of nodes in G.

In [11]:
swp = px.scatter(descriptive_stats,
                 x="number_of_nodes",
                 y="average_shortest_path_length",
                 size = "diameter",
                 trendline= "ols",
                 trendline_options=dict(log_x=True),
                 color = "diameter",
                 title="Relation between average path length and number of nodes (Figure 3)",
                 hover_name = descriptive_stats.index)
swp.update_layout(
    xaxis = dict(
        title = 'Amount of nodes'),
    yaxis = dict(
        title = 'Average (shortest) path length')
    )
swp.show()

According to Barabasi (2016), the small-world property is the only property that can be explained by the random network model. Thus, with this plot, there is no way to clearly distinguish whether the networks are real or random. However, the takeaways are firstly that for some networks the average path length indeed scales logarithmically with the size of the network. Secondly, there are only a handful of networks with a short average path lengths compared to the number of nodes in that network. Unsurprisingly these are also the networks with the shortest diameter. Thirdly, this tells us something about the way people interact with each other in different networks. In some networks the interactions create highly clustered networks, resulting in a lower average path length. In other networks the interactions create lower clustered networks resulting, in a lower average path length.

5. Bigger network¶

The effect of size on random networks.

Another way to analyse the randomness of our networks, is by checking the effect of their size on the average clustering coefficient. Barabasi (2016) states that in a random network, the average clustering coefficient is independent of the average degree but depends on the network size. In contrast to random networks, the average clustering coefficient of real networks decrease by the average degree and does not depend on the network size (Barabasi, 2016). By plotting a multivariate visualization that contains these three variables, we might be able to find a trend in our dataset that concerns the previous statement.

In [12]:
bubble = px.scatter(descriptive_stats,
                    x = 'average_degree', 
                    y = 'avg_cc', size = 'number_of_nodes', 
                    labels = list(descriptive_stats.index), 
                    hover_name = descriptive_stats.index, 
                    color = descriptive_stats.index, 
                    log_x = True, size_max = 60) 
bubble.update_layout( title = 'Average clustering coefficient per average degree (Figure 4)', 
                     xaxis = dict( title = 'Average degree', 
                                  gridcolor = 'white', 
                                  type = 'log', 
                                  gridwidth = 2, ), 
                     yaxis = dict( title = 'Average clustering coefficient', 
                                  gridcolor = 'white', 
                                  gridwidth = 2, ), 
                     paper_bgcolor = 'rgb(243, 243, 243)', 
                     plot_bgcolor = 'rgb(243, 243, 243)', ) 
bubble.show()

This bubble chart shows that the smaller networks generally have a higher average clustering coefficient. However, Nashville doesn't seem to follow the same trend as the other networks due to it's distance and size. Consequently, Nashville resembles a real network as it's average clustering coefficient doesn't seem to be affected by it's size. This can be better seen in the graph by removing Nashville from the graph, which can be done by clicking on Nashville in the index.

6. Friendships in communities - Triadic closure¶

Meet new friends through shared contacts.

To go futher into the interactions that create networks, we look at triadic closure. Triadic closure describes the phenomenon that people tend to meet new friends through shared contacts. It is called triadic closure because in a network graph a triangle is closed when someone meets a friend of a friend. The clustering coefficient describes how interconnected the neighbors of a particular node are with each other. Or in other words the fraction of the nodes possible triangles that are actually closed.

$$ c_u = \frac{2 T(u)}{deg(u)(deg(u)-1)}, $$

whereT(u) is the number of triangles through node u and deg(u) is the degree of u.

The average clustering coefficient of a network is defined as the clustering coefficient over all the nodes. According to Barabasi (2016) real networks with a small world property tend to have a high average clustering coefficient.

In [13]:
#colors = px.colors.qualitative.Plotly[:len(networks)]
#"blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue","blue"
avg_cc = px.bar(descriptive_stats, 
                x=list(networks.keys()), 
                y='avg_cc', 
                color = list(networks.keys()),
                color_discrete_sequence= ["#1302ab","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7","#7793f7"], 
                title='Nashville Network clearly has a significant higher clustering coefficient (Figure 5)')
avg_cc.update_layout(xaxis_title='Networks', yaxis_title='Average clustering coefficient', showlegend=False)
avg_cc.show()

The takeaway of this bar chart is that Nashville resembles a real social network due to it's significantly higher average clustering coefficient in comparison to the other networks.

What clustering means for real networks¶

Clustering tells us that many nodes have common neighbors. According to Barabasi (2016) in a real network, high degree nodes tend to have a smaller clustering coefficient than low degree nodes. The more neighbors you have, the harder it is for your neighbors to know the same neighbors as you. This relationship is absent in random networks. Since Nashville Meetup Network is the network with the highest average clustering coefficient, this network is chosen first to plot.

In order to minimize the loading time of the jupyter notebook, the computations are done in advance and written to a csv file, which then can be loaded in.

In [14]:
networks_with_degree_cc_file_path = {
    'NashvilleMeetupNetwork' : 'results/NashvilleMeetupNetwork_degree_vs_cc.csv',
    'DeezerHR' : 'results/DeezerHR_degree_vs_cc.csv',
    'DeezerHU' : 'results/DeezerHU_degree_vs_cc.csv',
    'DeezerRO' : 'results/DeezerRO_degree_vs_cc.csv',
    'FacebookLargepage' : 'results/FacebookLargepage_degree_vs_cc.csv',
    'FeatherDeezerSocial' : 'results/FeatherDeezerSocial_degree_vs_cc.csv',
    'FeatherLastfmSocial' : 'results/FeatherLastfmSocial_degree_vs_cc.csv', 
    'GithubSocial' : 'results/GithubSocial_degree_vs_cc.csv',
    'TwitchSocialNetworksDE' : 'results/TwitchSocialNetworksDE_degree_vs_cc.csv',
    'TwitchSocialNetworksENGB' : 'results/TwitchSocialNetworksENGB_degree_vs_cc.csv',
    'TwitchSocialNetworksES' : 'results/TwitchSocialNetworksES_degree_vs_cc.csv',
    'TwitchSocialNetworksFR' : 'results/TwitchSocialNetworksFR_degree_vs_cc.csv',
    'TwitchSocialNetworksPTBR' : 'results/TwitchSocialNetworksPTBR_degree_vs_cc.csv',
    'TwitchSocialNetworksRU' : 'results/TwitchSocialNetworksRU_degree_vs_cc.csv',
    'SocSignSlashdot090221' : 'results/SocSignSlashdot090221_degree_vs_cc.csv',
    'SocSlashdot0902' : 'results/SocSlashdot0902_degree_vs_cc.csv',
}

def write_cc_and_degree_to_csv(graph, file_path):
    """Computes cc and degree for each node and writes it to csv file."""
    degree_dict = dict(graph.degree())
    cc_dict = dict(nx.clustering(graph))
    data = {'degree': degree_dict, 'cc' : cc_dict}
    degree_cc = pd.DataFrame(data)
    degree_cc.to_csv(file_path)
    return

def calculate_correlation(file_path):
    """Return correlation between cc and degree."""
    degree_cc = pd.read_csv(file_path, index_col=0)
    return degree_cc.corr()['cc'].loc['degree']

## for loop to calculate cc and degree for all networks
# for name, network in networks.items():
#     write_cc_and_degree_to_csv(network, networks_with_degree_cc_file_path[name])

Plot Nashville Meetup network:

In [15]:
NashVilleMeetupNetwork_degree_cc_ = pd.read_csv('results/NashvilleMeetupNetwork_degree_vs_cc.csv', index_col=0)
fig = px.scatter(NashVilleMeetupNetwork_degree_cc_, x="degree", y="cc", title="Clustering coefficient VS degree for Nashville Meetup Network (Figure 6)")
fig.show()
NashVilleMeetupNetwork_degree_cc_.corr()
Out[15]:
degree cc
degree 1.000000 -0.645953
cc -0.645953 1.000000

Indeed, Nashville Meetup Network looks like a real network. A pearson correlation of -0.65 further substantiates this statement. Let's now look at the complete opposite, Github Social.

In [16]:
GithubSocial_degree_cc_ = pd.read_csv('results/GithubSocial_degree_vs_cc.csv', index_col=0)
fig = px.scatter(GithubSocial_degree_cc_,
                 x="degree",
                 y="cc",
                 title="Clustering coefficient VS degree for Github Social (Figure 7)",
                )
fig.show()
GithubSocial_degree_cc_.corr()
Out[16]:
degree cc
degree 1.000000 -0.046342
cc -0.046342 1.000000

Github Social on the other hand, does not have this real network property with a pearson correlation close to zero. It is unnecessary to plot this for every single network, to show the absence of this relationship. Therefore we construct a dataframe with all the pearson correlations.

In [17]:
cc_degree_corr = pd.DataFrame(index=networks.keys(), columns=['correlation'])
for name, file_path in networks_with_degree_cc_file_path.items():
    cc_degree_corr['correlation'].loc[name] = calculate_correlation(file_path)
cc_degree_corr
Out[17]:
correlation
NashvilleMeetupNetwork -0.645953
DeezerHR -0.029309
DeezerHU -0.059689
DeezerRO 0.001017
FacebookLargepage -0.017038
FeatherDeezerSocial -0.044112
FeatherLastfmSocial 0.042086
GithubSocial -0.046342
TwitchSocialNetworksDE -0.123471
TwitchSocialNetworksENGB -0.041845
TwitchSocialNetworksES -0.133733
TwitchSocialNetworksFR -0.161227
TwitchSocialNetworksPTBR -0.210791
TwitchSocialNetworksRU -0.063507
SocSignSlashdot090221 -0.027151
SocSlashdot0902 -0.027694

What clustering means for social interactions¶

This tendency to close triangles creates real networks with low distances, hubs, and high clustering. To better show how these interactions work, we look at the Barabasi-Albert model. This model was created to give insight into attachment which plays an important rol in real networks (Barabasi, 2016). In this model a network is created by adding new nodes and connecting them to a defined number of nodes that already exist in the network. This way of building networks corresponds more to triadic closure than networks created by the random network model. This dynamic also creates networks with hubs, which play a crucial rol in keeping the distance in networks small. As a concequence, the degree distribution looks more like one from a real network.

In [18]:
barabasi = nx.barabasi_albert_graph(11372, 100)
degree_sequence = [d for n, d in barabasi.degree()]
degree_count = nx.degree_histogram(barabasi)
degree_distribution = [count / sum(degree_count) for count in degree_count]

fig = go.Figure(data=go.Histogram(x=degree_sequence, y=degree_distribution))
fig.update_layout(
    title="Degree Distribution of Barabasi-Albert network (Figure 8)",
    xaxis_title="Degree",
    yaxis_title="Number of nodes with degree",
    bargap = 0.01,
)
fig.show()

7. Conclusion¶

In this data story, we explored the extent to which online social interactions are similar to real social interactions and the extent to which they differ. This was done by comparing different properties of the networks that emerge out of these different social interactions.

Through the analysis of degree distribution, we observed that the networks derived from social interactions differed from the binomial degree distribution of random networks, indicating that they are more likely real networks. Additionally, we calculated the average shortest path length and diameter to assess the small world property of the networks. While we couldn't conclusively determine if the networks were real or random based on this property alone, we noted that some networks exhibited shorter average path lengths, suggesting a higher degree of connectivity.

Furthermore, we explored the concept of triadic closure, where individuals tend to meet new friends through shared contacts. By examining the clustering coefficient, we found that real networks with a small world property tend to have a higher average clustering coefficient. This highlighted the interconnectedness of neighbors within the networks and supported the notion that social interactions in real networks often involve common acquaintances.

Based on the analysis, it can be concluded that Nashville Meetup Network is the only network that satisfies all the properties of a real network. The other networks in the analysis lacked high clustering coefficients and did not exhibit the same characteristics as real networks. The Nashville Meetup Network is the only network where the edges represent real world interactions and not interactions in a digital environment. This confirms our first perspective, which was that online social networks based on real life social interactions were more likely to resemble real networks. This perspective is confirmed because Nashville, our network which was based on real life social interaction, looked way more like a real network than our other networks that lacked those real life social interactions.

However, our data doesn't confirm our second perspective, which was that online social networks without real life social interaction were more likely to resemble random networks. This is because in some cases, our networks without real life social interaction did not resemble random networks.

These findings emphasize that humans are social creatures who thrive on real-world connections and shared experiences. In the digital realm, the dynamics of meeting new friends through shared contacts are not as prevalent. Online social networks are often characterized by a broader and more diverse range of connections, with less emphasis on close-knit communities.

8. Reflection¶

We received one point of feedback from our classmates who reviewed our draft, which was that we unnecessarily showed the index in Figure 3. We listened to the feedback and decided to remove the index, to leave more space for the graph.

Our TA gave us multiple points of feedback to work with.

  • The first point was to add perspectives to our project and support those perspectives with the datasets. We agreed that this was necessary and wrote two perspectives for the project.
  • The second point was to add an interactive visualizations. Our visualizations in the draft already had some interactive elements, such as the ability to remove certain networks from the visualization with the click of a button. We however decided to add a different type of interactive visualization with a slider, which was used to create the random networks.
  • Thirdly, our draft version was written too formal and not accessible to general public. We made some changes where we could with sufficient explanation to keep it understandable.

9. Work Distribution¶

Raúl¶

I took on most of the programming work, wrote different pieces of text. I also played an important role in searching many relevant concepts and ideas to incorporate into the project.

Elianto¶

I made some of the graphs, helped with the perspectives, wrote text for the introduction, wrote text for the conclusion, helped with the peer feedback and wrote the reflection. I also did multiple spelling checks while we were making this project.

Fardeen¶

I made a few graphs, helped with some text for the introduction, wrote initial text for the degree distribution and made other minor changes, such as a spelling check, and headlines for appendices.

Sharynho¶

I made the bubble chart visualization, helped with the fraction of nodes degree distribution, completed the dataset table, helped formulating the structure for perspectives and wrote about triadic closure and bigger network.

10. References¶

Barabasi, A. L. (2016). Network Science. Cambridge: Cambridge University Press. http://networksciencebook.com/

Menczer, F., Fortunato, S., & Davis, C. (2020). A First Course in Network Science. Cambridge: Cambridge University Press. doi:10.1017/9781108653947

Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393(6684), 440–442. https://doi.org/10.1038/30918

11. Appendix¶

Link to online repository¶

https://github.com/raul-vh/groepsproject-invi

Links to the datasets used in this project¶

Nashville https://www.kaggle.com/datasets/stkbailey/nashville-meetup?select=rsvps.csv

Deezer https://www.kaggle.com/datasets/andreagarritano/deezer-social-networks

Social Graphs https://www.kaggle.com/datasets/wolfram77/graphs-social

In [ ]: